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Abstract — Recovery of the sparsity pattern (or support) of a 
sparse vector from a small number of noisy linear samples is a 
common problem that arises in signal processing and statistics. 
In the high dimensional setting, it is known that recovery with 
a vanishing fraction of errors is impossible if the sampling rate 
and per-sample signal-to-noise ratio (SNR) are finite constants 
independent of the length of the vector. In this paper, it is 
shown that recovery with an arbitrarily small but constant 
fraction of errors is, however, possible, and that in some cases 
a computationally simple thresholding estimator is near-optimal. 
Upper bounds on the sampling rate needed to attain a desired 
fraction of errors are given in terms of the SNR and various key 
parameters of the unknown vector for two different estimators. 
The tightness of the bounds in a scaling sense, as a function of 
the SNR and the fraction of errors, is established by comparison 
with existing necessary bounds. Near optimality is shown for a 
wide variety of practically motivated signal models. 

Index Terms — compressed sensing, information-theoretic 
bounds, random matrices, random projections, regression, 
sparse approximation, sparsity, subset selection. 

I. Introduction 

Recovery of sparse or compressible signals from a lim- 
ited number of noisy linear projections is a problem that 
has received considerable attention in signal processing and 
statistics. Suppose, for instance, that a vector x of length n 
is known to have exactly k nonzero elements, but the values 
and locations of these elements are unknown and must be 
estimated from a set of m noisy linear projections (or samples 
U) of the form 

Y i = (<f> i ,x)+W i for i=l,...,m (1) 

where <p i are known sampling vectors, (•, •) denotes the usual 
euclidean inner product, and Wi is additive white Gaussian 
noise. Then, a key insight from sparse signal recovery is 
that the number of samples required for reliable estimation 
depends primarily on the number of nonzero elements, and is 
potentially much less than the length of the vector x. 

One estimation problem of particular interest is to determine 
which elements of the vector x are nonzero. This problem, 
which is refered to as sparsity pattern recovery in this paper, 
is known variously throughout the literature as support re- 
covery or model selection and has applications in compressed 
sensing J3-||4|, sparse approximation [5|, signal denoising (6), 
subset selection in regression Q, and structure estimation in 
graphical models j8). 



A large body of recent work [8|-|20| has considered exact 
recovery of the sparsity pattern by deriving necessary and 
sufficient conditions on scalings of the tuple (n, k, m) to 
ensure that the probability of exact recovery tends to one 
as the vector length n becomes large. In particular, one line 
of work has considered the fundamental limitations of the 
recovery problem that apply to any possible estimator, regard- 
less of computationally complexity. In the noiseless setting, 
for instance, it has been shown that m„ = k n + 1 samples 
are necessary and sufficient for an NP-hard combinatorial 
estimator |2~TI. ll22l. 

In presence of noise, however, support recovery depends 
critically on properties of the nonzero elements and cannot be 
characterized solely in terms of the dimensions n and k. In this 
setting, Wainwright [14| showed that m — k + 1 + C ■ fclogn 
samples are sufficient for an NP-hard combinatorial estimator 
where C is a finite constant that depends on the per-sample 
signal-to-noise ratio (SNR), the size of the smallest nonzero 
element, and various properties of the sampling vectors. En- 
suing work by Fletcher, Rangan, and Goyal lfl"8l and Wang, 
Wainwright, and Ramchandran lfl9l . showed that, for a poten- 
tially different constant C, this scaling is also necessary for 
any algorithm. 

In conjunction with the fundamental limitations outlined 
above, another line of work has studied scaling conditions for 
computationally tractable algorithms. In the noiseless setting, 
Donoho and Tanner 11231 showed that m = 2fclog(n/m) 
samples are sufficient for a polynomial-time linear program 
known as Basis Pursuit f6|. In the noisy setting, Wainwright 
fl3\ showed that m = k + 1 + C\ ■ k log n is sufficient for 
a polynomial-time algorithm known as the Lasso J24), and 
|[T8l showed that m = k + 1 + Ci ■ k log n is sufficient for a 
thresholding estimator where C\ and C2 are finite constants 
that depend on the SNR, the size of the smallest nonzero 
element, and and various properties of the sampling vectors. 

Although the scaling conditions outlined above show that 
exact recovery is possible with a relatively small number of 
samples, there exist two important limitations. First, if there 
is a non-vanishing fraction of nonzero elements, that is if 
k/n — > O for some sparsity rate Q, > 0, then these results 
say that the ratio m/n must grow without bound with n in 
order to overcome the effect of noise. This behavior is in 
marked contrast to other recovery tasks, such as estimation of 
the vector x with bounded mean squared error (MSE), which 
require only m > p-n samples for some finite sampling rate p 
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(see |25l , ||26l ). If noise is due to quantization, this means that 
accurate estimation with respect to MSE requires only fixed 
bit-rate whereas exact recovery of the sparsity pattern requires 
an unbounded bit-rate. 

The second limitation is that scaling results in terms of the 
dimensions (n, k, m) do not tell the whole story. Often, one 
needs to know the exact constants involved in the bounds, or at 
least the dependence of these constants on parameters such as 
the SNR or various assumptions about the vector x. For many 
of the estimation tasks considered throughout the compressed 
sensing literature, these properties are not well understood. 
As a result, the majority of sufficient conditions are far more 
conservative than those suggested by empirical evidence, and 
the optimality (or gap from optimality) of existing algorithms 
is difficult to determine due to the potential looseness of the 
necessary conditions. 

A. Outline of Main results 

In the present work, we derive upper bounds on the number 
of samples needed for approximate recovery of the sparsity 
pattern in the high dimensional setting when there exists a 
non-vanishing fraction of nonzero elements. In particular, we 
consider two different estimation algorithms — the NP-hard 
combinatorial optimization algorithm studied by Wainwright 
and the computationally efficient thresholding algorithm stud- 
ied by Fletcher et al. — and characterize the sampling rate 
p = m/n needed to ensure the number of errors in the 
estimated sparsity pattern does not exceed a ■ k for some 
error rate a. Corresponding lower bounds, which apply to 
any estimator, are derived in the companion paper l20l . An 
example of these bounds is shown in Figure Q] 
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Fig. 1. Bounds on the asymptotic sampling rate p = m/n and SNR required 
to identify the locations of at least 90% of the nonzero elements of a vector 
x £ 1" with sparsity rate Q = k/n = 10~ 4 when the power of the smallest 
nonzero element is at least 20% of the average power of the nonzero elements. 

The contributions of this paper directly address the limita- 
tions of the scaling results for exact recovery outlined above. 
With respect to the first limitation, we show that any error 
fraction a > can be achieved using a finite sampling rate p. 
In other words, approximate sparsity pattern recovery has the 
same scaling behavior as estimation of x with bounded mean 
squared error. If noise is due to quantization, this means that 



a fixed bit-rate is sufficient for approximate sparsity pattern 
recovery. 

With respect to the second limitation, our lower bounds are 
derived with an explicit dependence on various key problem 
parameters such as the SNR, the sparsity rate, and the relative 
size of the smallest nonzero elements. These bounds allow 
us to consider a wide variety of problem settings where the 
unknown vectors may be deterministic or stochastic and the 
magnitude of the smallest nonzero element may tend to zero 
as the vector length becomes large. Our framework allows us 
to address a number of important questions: 

• What is the tradeoff between the sampling rate and the 
SNR? Our bounds show that there are two fundamentally 
different settings. If the SNR is small relative to the 
desired distortion then the number of samples needed is 
inversely proportional to the SNR. However, if the SNR is 
large relative to the desired distortion, then the number of 
samples needed is inversely proportional to the logarithm 
of the SNR. 

• What is the tradeoff between optimality and computa- 
tional complexity? As illustrated in Figure Q] our bounds 
show that a computationally simple thresholding estima- 
tor is near-optimal in the low-SNR setting. In the high 
SNR-setting, however, only the computationally hard esti- 
mator is shown to attain near-optimal performance. These 
results suggest that significant computational challenges 
arise in the high-SNR setting where the difficulty of 
estimation is due primarily to the uncertainty about the 
nonzero values. 

• What happens as the desired error rate tends to zero? 
Our bounds show that the sampling rate depends on 
the inverse of the error rate 1/a. If the magnitudes of 
the nonzero elements have a fixed lower bound that is 
independent of n then this dependence is logarithmic. 
Otherwise, the dependence is polynomial. 

• What is the effect of prior information? The upper bounds 
in this paper correspond to estimators that know the 
approximate number of nonzero elements, but have no 
prior information about their values. The lower bounds 
in the companion paper [201 a Pply to settings where the 
estimator may know statistical information such as the 
average power, range of values, or distribution. Interest- 
ingly, the resulting bounds show that in many cases this 
additional knowledge does not significantly improve the 
ability to estimate the sparsity pattern. 

Beyond these results, our framework also permits us to 
prove some further insights. For instance, we show that the 
sampling rate distortion function is a convex function and that, 
in certain settings, i.i.d. sampling matrices are asymptotically 
strictly suboptimal. 

This paper is organized as follows: Section [TT] provides 
a precise problem formulation. Sections [III] and [TV] provide 
achievable bounds for two different estimators. Section [VI 
provides improved bounds using a particular set of sampling 
matrices that we will refer to as "rate-sharing" matrices. 
Section [VI] analyzes the scaling behavior of these bounds 
with respect to various key properties. Section I VIII presents 
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specific examples and illustrations, and proofs are given in 
the Appendices. The following section provides a brief, and 
necessarily incomplete, overview of work related to this paper. 

B. Related Work 

One line of related research G)-|U, (6), |23)-|34) has 
focused on the approximation of sparse vectors with respect 
to the £2 norm. From a scaling perspective, one particularly 
important result from this literature l25ll . Il26ll is that any 
fc-sparse vector x of length n can be approximated with 
bounded mean squared error ||x — x|| 2 /n < Ci/SNR using 
m = \C% ■ fclog(n/fc)] samples and a quadratic program 
known as Basis Pursuit (6) where C\,Ci are finite constants. 
In the absence of any noise, this result provides sufficient 
conditions for exact recovery of x, which in turn implies exact 
sparsity pattern recovery. In the presence of noise, however, 
it is important to observe that bounds on the mean squared 
error are insufficient to determine the accuracy of the estimated 
sparsity pattern. 

Another line of related research Il35l - l40l has focused 
on the rate-distortion behavior of sparse sources from an 
information-theoretic perspective. While these works provide 
valuable insights into the tradeoff between the SNR and the 
accuracy of approximation of x in the £2 sense, they do not 
directly address the problem of sparsity pattern recovery. 

Most closely related to the work in this paper is work 
which has focused directly on sparsity pattern recovery in the 
presence of noise l8l- ll20l . As discussed in the introduction, it 
is now well understood that m = k+l+C-k log n samples are 
both necessary and sufficient for exact recovery when the SNR 
is finite and there exists a fixed lower bound on the magnitude 
of the smallest nonzero elements. 

The problem of approximate support recovery with a 
nonzero error rate a has also been considered in recent work. 
For the special case where the values of the nonzero elements 
are known, Aeron, Zhao, and Saligrama ifTTI . ifPH showed 
that m = C ■ k\og(n/k) samples are necessary and sufficient 
where the constant C is given explicitly in terms of the error 
rate a, the SNR, and nonzero values. In the more general 
setting where the nonzero values are unknown the necessary 
and sufficient condition m = C ■ k\og(n/k) was derived 
independently by Akcakaya and Tarokh lfT31 and the authors 
of this paper lfl6l . ifTTll . l20l for the special case where 
k/n — > Q S (0, 1). While the work of lfl5l provides bounds 
for a variety of scalings, actual upper and lower bounds on 
the constant C are not explicitly stated. By contrast, the upper 
bounds in this paper, in conjunction with the lower bounds 
in the companion paper J20J, provide a tight characterization 
of the constant C for the setting of linear sparsity and show 
how this constant depends on various key properties such as 
the SNR, the size of the a/c'th smallest nonzero element, and 
the distortion a. Furthermore, while the upper bounds of lfT31 
correspond to a computationally hard joint typicality decoding 
strategy, which requires knowledge of both the sparsity k as 
well as the SNR, we show the same scaling can be achieved 
using a computationally tractable thresholding estimator which 
depends only on the sparsity k. 



II. Problem Formulation 

In this paper, we assume that x is an arbitrary (non-random) 
element from some subset X n C M". The sparsity pattern 
s C S n = {1,2, ■■ ■ , n} is the set of integers indexing the 
nonzero elements of x, 

s := {i : Xi ^ 0}, 

and the sparsity k = |s| is the number of nonzero elements. 
We denote by S£ the set of all subsets of S n of cardinality 
k. For simplicity, we assume that the sparsity k is known; in 
Section [V] we show that results obtained using this assumption 
can be extended to settings with only approximate knowledge 
about k. 

We assume that x is sampled using the noisy linear observa- 
tion model given in (HJ. In matrix form, the vector of samples 
Y e M. m can be expressed as 

Y = Ax + W 

where A £ ]g> mxrl i s a known sampling matrix with rows equal 
to 4>J and W € M m is a standard Gaussian vector. We further 
assume that an estimator (or recovery algorithm) is given the 
set (Y, A, k), and the goal is to recover the sparsity pattern s 
of x. . 

To quantify the distortion between a sparsity pattern s 
and its estimate s, it is important to observe that there are 
two different error events: one type of error occurs when an 
element in s is omitted from the estimate s and the other 
occurs when an element not present in s is included in s. In 
this paper, we use the distortion function d : S n x S n H> [0, 1] 
defined by 

d(s,s):=l (2) 

max(|s|, |s|) 

which corresponds to the maximum of the two types of errors. 
It can be verified that this distortion function is a metric on S n . 
We say that recovery is successful with respect to distortion 
a E [0, 1] if cifs, s) < a. Exact recovery corresponds to the 
case a = 0. 

We are interested in performance guarantees that hold 
uniformly for any x £ X n . It is important to note, however, 
that for any particular sampling matrix A, there may exist a 
degenerate subset of X n for which recovery is particularly 
difficult. To overcome the effects of these sets, we allow A 
to be a random matrix (denoted using boldface) distributed 
independently of x. Given any sparsity pattern estimator 
s : W n x R m x n x N H> S n , the probability of error corresponds 
to the worst case x € X n with respect to the distribution on 
A, 

P e (,l) = inf n Pr{d(s, s(Y, A, fc)) > aj. 

Estimation in the presence of noise depends critically on 
the size of the entries in the sampling matrix. In this paper, 
we assume that 

E[tr(AA T )] = m. (3) 

This scaling is consistent with the related work O, JH, lfl2l . 
|[T6l . IfTTI and corresponds to the setting where each sampling 
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vector (i.e. row of A) has unit magnitude. Thus, one useful 
property of this scaling is that the SNR of the linear samples 
given in (Q} can be compared directly with that of classical 
samples of the form Yi = xi + Wi. Another useful property is 
that the SNR does not depend on the number of samples to. 

We caution the reader that various other scalings of the 
sampling matrix are also used in the literature, and thus extra 
care is needed when comparing results. For instance, in ff3l . 
fl4"l . Ifl9l each element of A has unit power, and the squared 
magnitude of each sampling vector is thus proportional to the 
vector length n. 

To characterize the number of samples that are needed, we 
consider the high dimensional setting where the vector length 
n becomes large. We use X to denote a sequence of subsets 
{X n : n G N} and refer to A' as a vector source. The main 
question we address is whether or not recovery is possible 
when the number of samples is given by m n = \p-n] for some 
finite sampling rate p that is a fixed constant independent of n. 
We use the notation s interchangeably to denote an estimate 
of the sparsity pattern s, or a family of estimators {s„. m : 
n, m G N}. 

Definition 1. A sampling rate distortion pair (p, a) is said to 
be achievable for a vector source X if for each integer n there 
exists an estimator s and \p ■ n\ x n sampling matrix A such 
that 

P e (ll) -> as n -> oo. 

The sampling rate distortion function p(a) of a vector source 
X is the infimum of rates p > such that the pair (p, a) is 
achievable. The operational sampling rate distortion function 
p opl '(a) of an estimator s is the infimum of rates that are 
achievable using s. 

We focus exclusively on the scaling regime where the 
sparsity k scales linearly with the vector length n. 

Definition 2. Given any sparsity rate il G (0, 1/2), the vector 
source X(Q) is the set of all sequences {x'™' G M" :ngN} 
for which k n /n — > O as n — > oo. 

From a sampling perspective, the sparsity rate ft measures 
the degrees of freedom per dimension of x and is analogous 
to the rate of innovation BP or "bandwidth" of an infinite 
length discrete time sequence. 

One limitation of the general source X(Vt) is that the 
nonzero values may be arbitrarily small, thus making recovery 
in the presence of noise impossible. In previous work fPH . 
04), this issue is addressed by placing a lower bound on 
the magnitudes of the nonzero elements of x. This paper 
uses the more general approach where the nonzero elements 
are characterized by a distribution F or set of distribution 
functions J 7 . For any x € M n and sparsity pattern s we define 

to be the empirical distribution of {xi : i G s}. 

Definition 3. Given any sparsity rate f2 € (0,1/2) and 
distribution F with finite second moment and zero probability 



mass at zero, the vector source X(il, F) is the set of all 
sequences {x" G K n : n G N} for which k n /n —> fl 
and ||.F„ — ^Hoo ~> as n — !• oo. Given any set of 
distributions IF, the vector source X(H,,F) consists of the 
union U Fe ^X(Q,F). 

To be consistent with previous work, we may for example 
consider the source X(tt,F) where T denotes the set of 
all distributions whose support is bounded away from zero. 
However, one advantage of our approach is that we may also 
consider a source X(Q, F) where F has a density around 
zero, and thus a small number of nonzero elements may be 
arbitrarily small. 

III. Bounds for Combinatorial Optimization 

To understand the fundamental tradeoffs involved in sparsity 
pattern recovery, it is useful to consider the performance that 
can be achieved without any constraints on computational 
complexity. In this section, we study the behavior of a 
particular estimator which uses no prior information about 
the vector x other than the number k of nonzero elements, 
but requires solving an NP-hard combinatorial optimization 
problem. This estimator, which was used by Wainwright 
lfl4l to give fundamental scaling results for exact recovery, 
corresponds to the maximum likelihood estimate of the vector 
x or equivalently the least squares estimate over the £o ball of 
size k. In this paper, we refer to this estimator as the nearest 
subspace estimator. 

Definition 4 (Nearest Subspace Estimator). For a given set 
(y,A, k), the nearest subspace (NS) sparsity pattern estimate 
S NS is selected uniformly at random from the set 

argmin ||n(A s )yj| (4) 

where II(A S ) denotes the mxm orthogonal projection matrix 
onto the null space of the m x k submatrix A s . 

We remark that in many cases, the minimizer of <j4j is 
unique and the nearest subspace estimate is a deterministic 
function of its inputs. In this section, for example, we derive 
bounds by considering distributions on the sampling matrix 
A that guarantee uniqueness almost surely. In other important 
cases however, such as those considered in Section [V] the 
minimizing set contains multiple sparsity patterns and perfor- 
mance guarantees require randomness. 

Also, we remark that the nearest subspace estimator does 
not use any information about the nonzero values of x. On 
one hand, this means that the performance of nearest subspace 
estimator for a source X(£l, F) may be suboptimal in general. 
(For the special case of Gaussian sources, a connection to 
optimal estimation in the high SNR setting is discussed in 
Section I Villi ) On the other hand, this means that the oper- 
ational sampling rate distortion function of a general source 
X(Vt,F) corresponds directly to the worst case F G J 7 , a 
useful property that is not necessarily true for estimators that 
depends on F. 

Before we state our main results, we define the following 
the key properties of the vector source X(rt 7 F). 
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Definition 5. The power of a vector source X(£l, F) is given 
by 



P(Q,F) = n(fx%+a 2 F ) 



(5) 



where pp and a F are the mean and variance of F. 

Due to the scaling of the sampling matrix given by ©, the 
power P(fl,F) represents the SNR of the samples, that is 

EIIAxll 2 



P(Sl, F) = lim 



EIIWII 2 



Definition 6. For any < /3 < 1, the /3-truncated distribution 
Fp of a distribution F is defined as 

Fp(x) := Pr{X < x\Z = 0} (6) 

where X has distribution F and Z £ {0, 1} obeys 



Pr{Z = 1} 



if |X| > t f: 
Pp, if\X\ = tt 
0, 



if \X\ < tp 



with and pp chosen such that Pr{Z = 0} = f3. 

The /3-truncated distribution Fp characterizes the distribu- 
tion of the smallest (in magnitude) f3k nonzero elements of x. 
For instance, if F(x) has a nonzero density that is flat in a 
neighborhood around x = then Fp converges to a uniform 
distribution as f3 — > 0. 

Using the above definitions, we are able to state our main 
result which is an upper bound on the sampling rate distortion 
function p(a). The proof is given in Appendix lAl 

Theorem 1. The operational sampling rate distortion function 
p NS (a) of the vector source X(il,F) corresponding to the 
nearest subspace estimator is upper bounded by 

ns-ub, . ^ + (l-n)ff(^) _ 

(a) = \l + max -. -r (7) 

V ; <*<P<l C(l + P(j3n,Fp)) 

-x log(x) — (1 — x) log(l — x) denotes binary 



P 



where H(x) 
entropy, 



C{x) = \[\og{x)-^], (8) 

and P(-, •) and Fp are defined by (|5]l and © respectively. 
Moreover, if the sampling matrix is i.i.d. Gaussian, then for 
any sampling rate p > p NS ~ UB (a), there exists a constant C > 
such that the probability of error obeys 

P e (n) < exp(-C • n). 

One immediate consequence of Theorem Q] is that any 
distortion a > can be achieved using a finite sampling rate p. 
With respect to scalings of (n, k, m), this particular result has 
been shown in earlier version of this work 1161 . by Akcaya and 
Tarokh lfT31 in terms of unspecified constants, and by Aeron 
et al. ifTTl . Ifl2l for the discrete valued vectors. 

A key difference with respect to previous work, however, 
is that Theorem Q] provides an explicit upper bound on the 
value of the sampling rate generally for any distribution F. 
This characterization makes it possible to understand how the 
fundamental sampling rate distortion function p(a) depends 



on the distortion a, the SNR, or other key properties of the 
source. 

For instance, the first term on the right hand side of (O 
corresponds to the noiseless sampling rate distortion function 
of the general source X(Q) when the the sampling matrix is 
constrained to have i.i.d. elements (see ll20l Proposition 1]). 
The second term corresponds to the additional sampling rate 
needed to overcome the noise. At high SNR, this term is 
inversely proportional to the logarithm of the SNR. Further 
analysis of the bound in Theorem Q] with respect to scalings 
of a and the SNR is provided in Section [VI] 

IV. Bounds for Thresholding 

One question of practical importance is whether there exist 
computationally efficient estimators whose recovery perfor- 
mance is comparable to that of computationally unconstrained 
estimators such as the nearest subspace estimator studied in the 
previous section. In this section, we bound the sampling rate 
distortion function of a particular thresholding-style estimator 
and show that in some cases, near-optimal performance can 
be attained. The estimator we study was first introduced by 
Fletcher et al |[T8l . under the name maximum correlation, for 
the study of exact sparsity pattern recovery. In this paper, we 
refer to it as the thresholding estimator for reasons that will 
become clear shortly. 

Definition 7 (Thresholding Estimator). For a given set 
(y,A,k), the thresholding (TH) sparsity pattern estimate S TH 
is selected uniformly at random from the set 



arg max||A s y|| 



(9) 



Although the optimization in (O appears to be similar to 
the optimization problem in the nearest subspace estimate (0), 
the key difference is that the above problem can be solved 
efficiently by identifying the k largest elements of the n- 
dimensional z = A T y. If the minimizer of (O is unique, then 
the thresholding estimate can be expressed equivalently as 

s TH = {ie {1,2,- •• ,n} : \zi\ > t k (z)} 

where ffc(z) is the magnitude of the fc'th largest nonzero 
element of z. A connection between the thresholding estimator 
and optimal estimation for Gaussian sources at low SNR is 
discussed further in Section IVIIII 

We now give our main result which is an exact characteri- 
zation of the sampling rate distortion function p(a). The proof 
is given in Appendix [B] 

Theorem 2. The operational sampling rate distortion func- 
tion p TH (a) of the vector source X(Q,,F) corresponding to 
the thresholding estimator is upper bounded by the solution 
p TH UB (a) to 

~ TH - UB (a) x 2 



G 



(TTpP)' rl( ^) dFW = a (10) 

if a < 1 — f2 and is zero otherwise where G(p 2 ,t) = 1 — 
Q(t + p) — Q(t - p), Q(x) = f*^ -y= exp(— x 2 /2)dx and 
P(-, •) is defined by Q- Moreover, if the sampling matrix is 
i.i.d. Gaussian, then p TH (a) = p TH " UB (a). 
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In the special case where the distribution F is a zero-mean 
Gaussian, the solution to ( [Tol l can be expressed explicitly. 

Corollary 1. If F is a zero-mean Gaussian distribution, then 
the upper bound on the operational sampling rate distortion 
function of the thresholding estimator given in Theorem [2] is 
given by 



TH-UB 



(a) = n 



i + p(n,F) 

P(V,F) 



2(i-m 



(11) 



More generally, the solution to (TTOb can be upper bounded 
for any distribution F using the following result which de- 
pends only on the average power and the /3-truncated distri- 
bution of F. The proof is given in Appendix iBl 

Corollary 2. The upper bound on the operational sampling 
rate distortion function of the thresholding estimator given in 
Theorem [2] obeys 



TH-UB 



(a) < Q 



1 + P(Q,F) 



' 1 (f) + ' 



2(l-fi) 



(12) 



where Fp is defined by (O. 

A key contribution of Theorem[2]is that any distortion a > 
can be achieved using a finite sampling rate and a compu- 
tationally efficient estimator. This means that, with respect 
to scalings of the dimensions (n, k,m), the performance of 
the thresholding estimator is equivalent to that of the nearest 
subspace estimator 

A further contribution of Theorem [2] is that in many cases, 
the sampling rate distortion function p™{a) is significantly 
less than the upper bound for the nearest subspace estimator 
p NS " UB (a) given in Theorem Q] and thus provides an improved 
upper bound on the fundamental sampling rate distortion 
function p(a). In Section I VII we use this improved upper 
bound to characterize the rate at which p(a) tends to infinity 
as a tends to zero and the rate at which p(a) tends to infinity 
as the SNR tends to zero. 

At a high level, our proof of Theorem [2] is similar to the 
proof used by Fletcher et al. |[T8l for exact recovery in the 
sense that both proofs depend on the asymptotic behavior 
of the vector A T x. Technically, however, there is a key 
difference: whereas exact recovery depends on the extreme 
order statistics, which can be controlled using a union bound, 
approximate recovery depends on the limiting empirical dis- 
tribution. Since the elements of A T x are not independent, the 
main challenge in our proof is showing convergence of the 
empirical distribution. We use standard truncation arguments 
as well as Pinsker's inequality and manipulation of various 
mutual informations. 



V. Rate-Sharing Sampling Matrices 

The upper bounds on the sampling rate distortion function 
p(a) in Theorems Q] and [2] were derived by analyzing the 
special case where the elements of A are i.i.d. zero-mean 
Gaussian. In this section, we show that any upper bound on the 



operational sampling rate distortion function p opT (a) of a vec- 
tor source F) and estimator s can be convexified using 
"rate-sharing" sampling matrices. This result both strengthens 
our previous bounds and proves that p(a) is a convex function. 

Before we discuss this strategy, we need the following 
useful result which shows that the operational sampling rate 
distortion function p° pl '(a) of an estimator s does not change 
if the estimator only has approximate information about the 
sparsity k n . The proof is given in Appendix ICl 

Theorem 3. Let p° pr (a) be the operational sampling rate 
distortion function of a vector source F) and estimator 

s. Let p° pl (a) be the corresponding operational sampling rate 
distortion function when the estimator s uses some sparsity 
sequence k„ instead of the true sparsity sequence k n . If 
lim„_ ) . 00 \k n — k„\/n = and limsup rwoo k„ - k n < 0, 
then p opl (a) = p opl (a). 

One consequence of Theorem [3] is that our bounds on p{a) 
apply to the setting where the number of elements is a random 
variable that concentrates around the expected value fin. This 
occurs, for example, if the elements of x are i.i.d. random 
variables with distribution function Fx given by 

F x {x) = (1 - fi)l(a: = 0) + QF(x). 

The fact that the sparsity k n does not need to be known 
exactly is also an important part of our convexification strategy 
which is described in the following result. The proof is given 
in Appendix iDl 

Theorem 4. Let (pi,a±) and (p2,ct%) be two achievable 
sampling rate distortion pairs for a vector source X(£l,F) 



and let {A^ € 



) [pi-n] xn 



} and {A ( 2 n) G 



■n \ xn 



} be 



the sampling matrix sequences that achieve these rates. For 
any A € (0, 1), let {A^")} be a rate-sharing sampling matrix 
sequence defined by 



A («) = 



A 









(n-fX-n]) 



p(") 



(13) 



where p(") is distributed uniformly over the set of n x n 
permutation matrices. Then, the sampling rate distortion pair 
(\p\ + (1 — A)/92,Aa2 + (1 — A)a2) is achievable using the 
sequence {A*-™-*}. 

Two consequences of Theorem |4] are the following. 

Corollary 3. For any estimator s, the operational sampling 
rate distortion function p° pr (a) of the vector source X(Q,,F) 
is convex. 

Corollary 4. The sampling rate distortion function p{a) of 
the vector source X(fl,F) is convex. 

Additionally, since p(a) = for any distortion a > 1 — O, 
we may conclude that 



p(Xa + (l - A)(l - ft)) < \p{a) 



(14) 



for all < A < 1. 

From a practical standpoint, the rate-sharing strategy used 
in Theorem [4] shows that, in the high-dimensional setting, our 
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sampling problem can be split into two separate subproblems. 
Since the splitting is done randomly, the properties of each 
subproblem, such as the empirical distribution of x, can be 
accurately characterized in terms of the original problem. 

Lastly, we remark upon a key difference between the i.i.d. 
Gaussian matrices used in Theorems Q] and [2] and the rate- 
sharing matrices described in Theorem |4] with respected to 
recovering vectors that are sparse in a basis other than the 
standard basis. In particular, suppose that x is not actually 
sparse, but instead has some sparse representation x with 
respect to an orthonormal matrix B G K" xn , that is x = Bx. 
If both the designer of the sampling matrix A and the estimator 
s know B, then any matrix A designed for the original 
setting (where B = I n ) can be realized in the generalized 
setting by using the modified sampling matrix AB^ 1 . What 
happens, however, if only the estimator knows B1 Since any 
i.i.d. Gaussian matrix A is equal in distribution to AB, the 
bounds given in Theorems Q] and [2] still apply. By contrast, 
the rate-sharing matrix in Theorem [4] depends critically on the 
knowledge of B and cannot be used if A must be designed 
independently of B. For a further discussion of this issue, see 



VI. Scaling Behavior 

In this section, we show how the bounds on the sampling 
rate distortion function p{a) given in Theorems Q] and [2] 
depend on the distortion a and various properties of the source 
X(Q,F) such as the sparsity rate f2 or the power P(Q,F). 
By comparing simplified versions of the upper bounds in this 
paper with the lower bounds from the companion paper l20l . 
we are able address questions such as how p(a) increases as 
a becomes small and how p(a) converges to the noiseless rate 
Po{a) as the SNR becomes large. 

One key property of the source is the power P(il,F), To 
describe scalings of the power we use X(il, F; P) to denote a 
source characterized by a distribution F that is scaled to have 
power P. Another key property of the source is the following. 

Definition 8. The decay rate L G [0, oo] of a distribution 
function F is defined as 

loge 



L := lim — . 

*-o log (F( e ) - F(-e)) 



(15) 



if the limit exists. 



The decay rate L characterizes the relative size the smallest 
nonzero elements drawn from a source X(il,F). For instance, 
if X is a random variable with decay rate L < oo, and we 
define 

x e = inf{x > : Pr{|X| < x} > e}, 

then e~ L ■ x e — > c as e — > for some c G (0, oo). The decay 
rate is L = if X is bounded away from zero and L = oo if 
and only if Pi{X = 0} > 0. We denote by T the set of all 
distributions F with finite second moment and zero probability 
mass at zero, and note that L is finite for any F G To- 

One useful property of the decay rate is that it can be used to 
bound the relative power of the /3-truncated distribution given 
in Definition [6] 



Lemma 1 ( 11201 ). Given any distribution function F G To 
with decay rate L, there exist constants < Cp < Cp < oo 
such that 



(16) 



for any < j3 < 1. 



Using the above properties, we are able to provide the 
following simplified versions of Theorem Q] and [2] 

Proposition 1. Given any distribution F € J-q, there exists 
a constant Cf < oo such that the operational sampling rate 
distortion function p NS (a) of the vector source X(Vt,F;P) 
corresponding to the nearest subspace estimator is upper 
bounded by 



P (a) <M + C F ■ max 



fSSl' 



(17) 



/3e{a,l} log(l + /3 4L+2 P 2 ) 

for all distortions a G (0, 1/4) where F has decay rate L. 

Proposition 2. Given any distribution F G To, there exists 
a constant Cf < oo such that the operational sampling rate 
distortion function p TR (a) of the vector source X{VL,F]P) 
corresponding to the thresholding estimator is upper bounded 
by 



p TH (a) < C F 



1 + P\ "Mir) 



V 2L 



(18) 



P a 

for all distortions a G (0, 1/4) where F has decay rate L. 

To understand the significance of Propositions Q] and [2] it is 
useful to consider the lower bounds from [201. 



Proposition 3 ( l20l ). Given any distribution F G To, there 
exists a constant Cf > such that the sampling rate 
distortion function p(a) of the vector source X(fl, F; P) is 
lower bounded by 



p(a) >Cf 



(19) 



log (1 + a 2L + l P) 

for all distortions a G (0, 1/4) where F has decay rate L. 

Combining Propositions|2]and[3]provides the following tight 
characterization of the scaling of p{a) with respect to a. (This 
result corresponds to Proposition 8 in |20l .) 



Proposition 4. Given any distribution F G To and sparsity 
rate f2, there exist constants < Cp n < Cp n < oo such 
that the sampling rate distortion function p(a) of the vector 
source X(Q, F) obeys 

C F .n (±r 2L log (i) < p(o) < C+ n (^r 2L log (i) (20) 

for all distortions a G (0, 1/4) where F has decay rate L. 

Furthermore, combining Propositions Q] [2] and [3] gives the 
following characterization of the scaling of p(a) with respect 
to the SNR. 

Proposition 5. Given any distribution F G To, sparsity rate 
f2, and distortion a G (0,1/4), there exist constants < 
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C^jj < Cp n < oo such that the sampling rate distortion 
function p(a) of the vector source F; P) obeys 



C 



F,n,, 



< p(a) <n + 



(21) 



log(l + P) ~ rx ' ~" • log(l + P) 

Proposition shows that the tradeoff between the sampling 
rate p and the SNR exhibits two different behaviors: at low 
SNR, the sampling rate is inversely proportional to the SNR 
and at high SNR, the sampling rate is inversely proportional 
to the logarithm of the SNR. 

The constant gap between the bounds in Proposition [5] 
corresponds, roughly speaking, to the use of trivial bounds 
< Po( a ) < on the noiseless sampling rate distortion 
function po(a). In some cases (see for example Proposition 
2 in [20]) the lower bound in tight. However, in this special 
case where the sampling matrix is constrained to have i.i.d. 
elements, we may use stronger bounds from ||20| to show that 
the upper bound is tight. To state this result, we need the 
following function which measures the relative entropy power 
of the distribution F. Given any sparsity rate £1 and distribution 
F with mean pp, variance a F , and differential entropy h(F), 
we define the function 6(Q,F) £ [0, 1] to be 

(2 7 re)- 1 exp(2/i(F)) 



0(Sl,F) 



(22) 



Proposition 6 ( 1201 ). If the sampling matrix is Ltd., then 
there exists a constant C > such that the sampling rate 
distortion function p(a) of the vector source X(fl, F; P) is 
lower bounded by 

Olog(i) 



p(a)>Q + C- 

log(l + P) 

for all distortions a G (0, 1/4) that satisfy 



(23) 



0(fi, F) > cxp (l - ^R(Q, a)j (24) 
where R{Q,a) = H(Q) - QH{a) - (1-Q)H{{^). 

VII. Examples and Illustrations 

This section provides specific examples and illustrations of 
the upper bounds on the sampling rate distortion function p(a) 
given in Theorems Q] [2 and [4] 

A. Bounds for a Gaussian Source 

To begin, we consider the setting where F is the distribution 
of a zero-mean Gaussian random variable with variance a 2 . 
It is clear to see that the power of the source X(Q,, F) is 
given by P(fi, F) = Via 2 , and it can be shown (see Appendix 
D of [20 1) that the power corresponding to the /3-truncated 
distribution Fp is given by 



P(Sl,F p ) = [l- (t /3 //3)(2A) 1 / 2 ex P (-t2/2) 



nap 



(25) 



where tp = Q~ 1 {^-^-) and denotes the functional 

inverse of Q(x) = (2n)~ 1 / 2 cxp(— x 2 /2)dx. 

Since the Gaussian distribution has decay rate L = 1, we 
know that the power P(f2, Fp) scales like /3 2 for small j3. 



Applying a Taylor expansion to the to (f25T > gives the more 
precise characterization 
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lirn— P(Sl,F p ) 



-O. 



^o/J 2 v ' p ' 6 

To illustrate our bounds on the sampling rate distortion 
function of this source, we first consider the special case where 
the sampling matrix is required to have i.i.d. elements. In 
Figure |2ta)-(b), the upper bounds on p(a) given in Theorems 
[T]and |2] are shown as a function of a for two different SNRs. 
Since these bounds correspond to an i.i.d. Gaussian sampling 
matrix, they represent valid upper bounds for i.i.d. sampling 
matrices. Also shown is a lower bound (see ||20l Theorem 4]) 
which applies universally to any estimator and i.i.d. sampling 
matrix. 

In the low SNR setting (Figure |2|a)) the best bound cor- 
responds to the thresholding estimator, and in the high SNR 
setting (Figure [2jb)) the best bound corresponds to the nearest 
subspace estimator. Since our analysis of the thresholding 
estimator is exact (at least for the case of a Gaussian sam- 
pling matrix), these results indicate that the nearest subspace 
estimator is significantly better than the thresholding estimator 
at high SNR. However, since the upper bound on the nearest 
subspace estimator is not necessarily tight, it is not possible 
to say which estimator is better at low SNR. It is interesting 
to observe that in both cases, the bounds show that p(a) is 
relatively fiat for nonzero distortions a, but tend to infinity 
rapidly asa^O. This behavior shows why it is important to 
consider approximate recovery of the sparsity pattern. 

Next, we address what happens if there are no constraints 
on the sampling matrix (other than the normalization imposed 
by (0). In Figure [3 a)-(b), convex versions of the upper bound 
on p(a) given in Theorems Q] and [2] corresponding to the rate- 
sharing strategy given in Theorem [4] are shown as a function 
of a for two different SNRs. Also shown is a lower bound (see 
l20l Theorem 3]) which applies universally to any estimator 
and sampling matrix. 

The basic behavior of the bounds in Figure [3] is similar to 
that of the bounds for i.i.d. sampling matrices in Figure [2] 
For the upper bounds, the main difference between the two 
settings occurs at relatively large distortions. For the lower 
bounds, the difference is most prominent at high SNR. It is 
particularly interesting to note that for large distortions, the 
high SNR nearest subspace bound in Figure [2b) is less than 
the corresponding lower bound in Figure |2jb). This behavior 
addresses the question as to whether or not i.i.d. sampling 
matrices are optimal in the asymptotic setting and shows that 
in certain cases such matrices are suboptimal. 

B. Bounds for a General Source 

We now show how our upper bounds, which are stated in 
terms of a single distribution function F, can be extended to 
more general sources characterized by a set of distributions J 7 . 
In particular, we consider the source Af(f2, -P(?J,7)) where, 
for any parameters rj € [0,1] and 7 <G (0, 00), we define 
F{r), 7) C Fq to be the set of all distributions with power 7 
and a lower bound ^/rpj on the magnitude of any realization. 
This source, which we refer to as the bounded source, assumes 
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(a) Gaussian Source, SNR = dB 




(b) Gaussian Source, SNR = 40 dB 
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(d) Bounded Source, SNR = 40 dB 
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Fig. 2. Bounds of sampling rate distortion function p(a) for sampling matrices with i.i.d. elements. The upper bounds correspond to Theorems [T] and [5] 
The lower bound corresponds to Theorem 4 in 1201 . Plots (a)-(b) correspond to a zero-mean Gaussian source. Plots (c)-(d) correspond to the general bounded 
source X(Q, 7)) with -q = 0.2. In all cases f! = 10 — 4 . 
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(b) Gaussian Source, SNR = 40 dB 
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(c) Bounded Source, SNR = dB 




(d) Bounded Source, SNR = 40 dB 
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Fig. 3. Bounds of sampling rate distortion function p(ct) for optimal sampling matrices. The upper bounds correspond to the convex versions of Theorems 
[T] and [2] that are attained using the rate-sharing sampling matrices described Theorem [4] The lower bound corresponds to Theorem 3 in [20]. Plots (a)-(b) 
correspond to a zero-mean Gaussian source. Plots (c)-(d) correspond to the general bounded source X(Q, ^(r), 7)) with r) = 0.2. In all cases Q = 10 -4 . 
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very little about the actual nonzero values and corresponds to 
assumptions commonly used in literature for exact recovery 

ID, (a, HD-Eo). 

To apply the bounds given in Theorems Q] and [2] we use 
the simple fact that, for any set of distributions T and fixed 
estimator s, the operational sampling rate distortion function of 
the source X(£l, T) is equal to the supremum over F £ T of 
the operational sampling rate distortion of the source X(Vt, F). 
In other words, we may uniformly bound the source X(Jl, T) 
by considering the worst case distribution F £ J- . We remark 
that this useful property, which is true for any fixed estimator 
s, is not necessarily true for the fundamental sampling rate 
distortion function since the optimal estimator for a source 
X(£l,F) depends on the distribution F. 

To bound the operational sampling rate distortion function 
corresponding to the nearest subspace estimator, we observe 
that for any distribution F, a lower bound on the power 
P(fi, Fp) of the /^-truncated distribution Fp provides an upper 
bound on the right hand side of Q. Using the fact that 
infp 6 jr(r ) . 7 ) P(Q, Fp) = ^^777, gives the following upper 
bound 

ns-ub, s Q+ ng(^) + (i-n)g(^) 

(a) = 1 I + max -, 7 . (26) 

a <p<i 



i2(l + /5077-y) 

To bound the operational sampling rate distortion function 
corresponding to the thresholding estimator, we use Corollary 
|2] and the fact that 

1 + P{Q,F) l + fi 7 



sup 



to obtain the upper bound 

1 + ^7 



777 f2 



(a) < n- 



777^ 



(27) 



In Figure[2]c)-(d), the upper bounds d26j i and ((27) are shown 
as a function of the the distortion a for two different SNRs. 
Also shown is a lower bound designed specifically for the 
bounded source Af(f2, T(r], 7)) and i.i.d. sampling matrices 
(see ll20l Section VTC]). In Figure Q] the same bounds are 
shown as a function of the SNR for fixed a. In Figure [3] 
(c)-(d), convex versions of the upper bounds d26| i and ([27} 
corresponding to the rate-sharing strategy given in Theorem [4] 
are shown as a function of a for various SNRs. Also shown is 
a lower bound (see 1201 Theorem 3]) which applies universally 
to any estimator and sampling matrix. 

Our bounds for the bounded source have many similarities 
with the bounds for the Gaussian source. In both cases, the 
nearest subspace estimator bound is stronger at high SNR 
and the thresholding bounds is stronger at low SNR. Also, 
in both cases, the bounds are relatively flat for a range of a, 
but increase rapidly as a tends to zero. The main difference 
between bounded source and Gaussian source, however, is the 
point at which this change in behavior occurs. For example, in 
Figures 0a)-(b) the bounds increase rapidly for all a less than 
0.05. In the corresponding plots for the bounded source, Figure 
I2c)-(d), this same behavior occurs, but only for a much less 
than 0.01. 

The reason that small distortions are easier to obtain for 
the bounded source than for the Gaussian source is that the 



difficulty of recovery for a w is dominated by the size of 
the smallest nonzero values. For the bounded source, these 
values are bounded away from zero whereas for the Gaussian 
source the values may be arbitrarily small. This property of a 
source is precisely what the decay rate L given Definition 
[8] is designed to capture. For example, using the fact that 
the Gaussian distribution has decay rate L = 1 and every 
distribution F £ J~(r], 7) has decay rate L = 0, the precise 
rate at which p(a) increases as a — > can be determined 
using Proposition H] 

VIII. Connections with Optimal Estimation 

In this last section, we study the optimal estimator for the 
vector source X(il, F) characterized by a zero-mean Gaussian 
distribution F, and show that certain limiting versions of this 
estimator correspond to the nearest subspace and thresholding 
estimators studied in Sections [ill] and [IV] 

Our first result describes the optimal estimator associated 
with our distortion metric (|2). The proof, which follows 
directly from the Bayesian formulation of the problem, is left 
as an exercise. 

Proposition 7. Let X £ R n be a random vector whose 
sparsity pattern S is distributed uniformly over SJ} and whose 
nonzero elements {Xi : i £ s} are standard Gaussian random 
variables, and let 



Y = ^/7 ■ AX + v/T^ • W 

for some 7 £ (0, 1) where A £ W nxn is a known sampling 
matrix and W is a standard Gaussian random vector. For 
any distortion a £ [0, 1], the sparsity pattern estimate S that 
minimizes the error probability Pr{<i(S, S) > a} corresponds 
to the estimator 



§ 7 P a(y,A fc ) e argmax V expos')} 



(28) 



S 'es Q (s) 



where B a (s) = {s' £ S% : d(s, s') < a} and 



V>(s) 



|s- 1/2 y|| 2 



log|S s 



with S s = ~/A s AT + (1 - 7)/,, 



Unlike the nearest subspace and thresholding estimators, the 
optimal Gaussian estimator d28l i depends on the SNR param- 
eter 7 as well as the distortion bound a. One consequence of 
this dependence is that the optimal estimate is not invariant to 
scalings of the samples y. Another consequence is that it is 
not possible in general for a single sparsity pattern estimate s 
to be uniformly optimal for all distortions a £ [0, 1). 

To determine whether or not the optimal Gaussian estimator 
is significantly better than estimators that do not depend on 
the parameters 7 or a, it is useful to consider the bounds on 
the operational sampling rate distortion functions illustrated in 
Figures [2] and [3] Since the lower bounds in these figures apply 
universally to any estimator, including the optimal Gaussian 
estimator, the relative tightness of the upper and lower bounds 
shows that near-optimal performance can be attained using 
the nearest subspace estimator in the high SNR setting and 
the thresholding estimator in the low SNR setting. 
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One explanation for this behavior is provided by the fol- 
lowing non-asymptotic result which shows that the optimal 
Gaussian estimator converges pointwise to the nearest sub- 
space estimator as 7 — > 1 and to the thresholding estimator as 
7 — > 0. The proof of this result is rather long and and can be 
found in l43l . 

Theorem 5. Let y G R m , A G R mx ™ and k G {1, 2, • • ■ , n} 
be fixed inputs. Let a € [0,1) be a fixed distortion. 

1) If the nearest subspace estimator is unique, then 

lims° p Q T (y,A,fc) =s NS (y,A,fc). (29) 

7->l ' 

2) If the thresholding estimator is unique and the columns 
of A are equal magnitude, then 

]ims^(y,A,k) = s TH (y,A,k). (30) 

We first address the high SNR convergence (|29l shown in 
Theorem[5] For the special case of exact recovery (a = 0), this 
result is not particularly surprising, in part because the nearest 
subspace estimator corresponds to the maximum likelihood 
estimate of the vector x. What is less obvious, and in fact 
significantly more difficult to prove, is that this convergence 
occurs universally for all distortion bounds a. This result 
is interesting because it explains why the nearest subspace 
algorithm performs so well in the high SNR setting for a large 
range of distortions. Nevertheless, it is important to keep in 
mind that the rate at which the probability of error tends to 
zero still depends on the distortion bound a (see for instance, 
the example given in ll43l Proposition 2]). 

Next, we address the low SNR convergence (f30b shown in 
Theorem[5] In this case, the log likelihood function cxp{-0(s)} 
becomes proportional to || A s y|| 2 as 7 — > 0. Using the fact that 
|| A s yj| 2 can be expressed as a sum over terms indexed by the 
indices in s shows that the convergence occurs uniformly for 
all a. What is particularly interesting about this result is that 
it shows that near-optimal estimation in the low SNR setting 
can be achieved using a computationally efficient estimator. 

Appendix A 
Proof of TheoremQ] 

Let s* denote the true sparsity pattern of x and define the 

set B a := {s G S£ : d(s* , s) < a}. Observe that the event 

£ := I min ||n(A s )Y|| 2 < min ||II(A S )Y|| 2 1 

guarantees that the distortion of the nearest subspace estimate 
is less than or equal to a. Moreover, for any threshold t, the 
event £ is implied by £\ (t) fl £2 {t) where 

£t(t) = I mm ||n(Afc)Y|| 2 < (m - k) ■ t j , 
£ 2 (t) = ( min ||n(A fc )Y|| 2 > (m - k) ■ t\ , 

and thus the probability of error can be upper bounded as 

P e W < Pr{£ c } < Pr{ff(t)} +Pr{£ 2 c (i)}. 



In the following, we show that there exists a threshold t 
such that both Pr{£f (t)} and Prj^WI decay exponentially 
rapidly with the problem size n when the elements of the 
sampling matrix A are i.i.d. zero-mean Gaussian random 
variables. 

We begin with following result which characterizes the 
marginal distribution of each projection ||n(A s )Y|| 2 . For 
two sets s and u we use s\u to denote the difference set 
{s G s : s u}. Additionally, we use Xd to denote a chi- 
square random variable with d degrees of freedom. 

Lemma 2. For any sparsity pattern s G S^, the random 
variable 

(i + ^llx^liy'lin^Yii 2 

has a chi-square distribution with m — k degrees of freedom. 
Proof: Since n(A s ) projects onto the null space of A s , 

II(A S )Y =n(A s )[A s ,x s * +W] 

= II(A s )[A s ,\ s x s *\ s + W] 

where the vector A s .\ s x s »\ s + W is independent of II (A s ) 
and has i.i.d. Gaussian elements with zero mean and variance 
l + i||x s ,\ s || 2 . Using the rotational invariance of the Gaussian 
distribution, and the fact that the projection matrix IT(A S ) has 
rank m — k almost surely concludes the proof. ■ 
We now consider the event £\{t). Using Lemma [2] gives 

Pr{££(t)} = Pr ( min ||II(A S )Y|| 2 >{m-k)-t\ 

< Pr{||n(A s »)Y|| 2 > (m-fc)-t} 
= Pr{xl l . k >(m-k)-t}. 

Applying the chi-square concentration bounds in Lemma [8] in 
Appendix [F] shows that 

limsup i logPr {£f(t)} < for any t > 1. (31) 

n— f 00 

Next we consider the event £a(t). Using the union bound 
and Lemma |2] gives 

Pr{£ 2 c (t)} = Pr{ min ||n(A s )Y|| 2 < (m - k) ■ t) 
< \B°\ maxPr j||II(A s )Y|| 2 < (m-k) ■ 

If we defined the set Up :~ {s : d(s*,s) = \f3k~\/k} and the 
functional 

Pp(x) := mini||x s , Xs || 2 , (32) 
then we obtain the further bound 
Pr{£ m < k maxJ^lPr^ < ^^p } 
.fcmaxJ^lPr^^i^^}. 

(33) 
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A simple counting argument gives 

/ k \ fn - k> 

and using Lemma [7] in Appendix [F] shows that 

lim ilog|^ l | = SlHtf) - (l-n)H(^). (34) 

To lower bound P^(x) we use the following result. 

Lemma 3. For any sequence of vectors {x^™)} € X(Q,F) 
and e € (0, 1), 

liminfPfl(x (n) ) > P(0,Fs_ e ) 

uniformly for all (3 £ (e, 1) where P(-, •) anc/ i 7 ^ are defined 
by (O a«(i © respectively. 

Proof: For each n, let F n and F n denote the empirical 
distributions of {xi : i 6 s*} and {x 2 : i £ s*} respectively. 
Furthermore, let F~ 1 (p) = ini{x : F n (x) < p} denote the 
inverse of F n and observe that 



P^(x) = - ]T F- 1 (*/[/3fcl) > - • / F^fr)*. (35) 

Next, let X be a random variable with distribution F and let 
F be the distribution of X 2 . By definition, F n — > F uniformly 
for any sequence drawn from the vector source X(H,,F). We 
note that in some cases, this convergence is sufficient to show 
that F^ 1 F^ 1 uniformly. In general, however, flat sections 
in F may correspond to discontinuities in F^ 1 and thus more 
care is needed. 

Observe that the convergence F n F does guarantee that 
for n large enough, 



F- x (p-e), ifp>e 



(36) 



0, if p < e 

for all < p < 1 — e. Hence, combining ( f35T > and d36i l gives, 

liminfP^x^) > n / F- 1 (p-e)dp = P(0,Pg_ e ) 

which completes the proof. ■ 
Applying Lemma [3] and the chi-square large deviation 
bounds from Lemma [8] in Appendix |F| shows that for any 
e > and t < I + P(Cl, Fp_J), 



limsupilogPr^Xm-& < 



<(p-CL)£ 



(m — k) ■ t 
m ~ k l + P /3 (x(™)) 

i + p(n,Fff_ 6 ) 



(37) 



Combining d33l , ( [34-b and (|37] i shows that the probability 
of error decays exponentially rapidly with n provided that the 
sampling rate p obeys 

p > 11 + max -. ■ 

pe[a,i] jr ( 



for some e > and threshold 1 <t < 1+P(fi, Fp_ t ). Taking 
the infimum over all such rates completes the proof. 



Appendix B 
Proof of Theorem[2] 

For each integer n, denote by Z„ = [Z n ^,- 
random vector with elements given by 



) Z nyn ] the 



m(H 



(Ai,Y) 



and define the functions 



D n +(i) = ^l(4<(). 



With a bit of work it can be verified that the thresholding 
estimate is successful with respect to distortion a if and only 
if 



infmax{^D-(t),D+(t)}<a. 



(38) 



In the following, we identify the infimum over all sampling 
rates p such that d38l occurs with probability tending to one 
as n — > oo when the elements of the sampling matrix A are 
i.i.d. zero-mean Gaussian random variables. 

The key technical parts of the proof are given by the fol- 
lowing results which characterize the asymptotic convergence 
of the empirical distributions D£(t) and D~(t). The proofs 
are given in Sections IB-AI and IB-BI below. 

Lemma 4. For any sequence {x(™)} £ X(Q,F) and e > 0, 



lim Pr{ sup LD-(t) - D+{t) \ > e \ = 

n->oo It J 

where D~(t) = 2Q(t). 

Lemma 5. For any sequence {x^} £ X(Q,F) and e > 0, 
lim Pr I sup I D+ (t) - D + (t-,n,F,p)\ > e i = 

n-s-oo [ t 1 J 

where D+(t; ft, F,p)=f R G( 1+ P P ^, F) , t)dF(u). 

Using Lemmas @] and [5] the remainder of the proof is 
relatively straightforward. Since D~(i) is continuous and 
strictly decreasing, and since D + (t; il, F, p) is continuous and 
strictly increasing in t, and continuous and strictly decreasing 
in p, the bound defined in Theorem [2] can be rewritten as 



'(a) = inf {p : D + (t*;il, F, p) < a} where 
t* = inf{t : il^D-(t) < a}. 



The convergence given in Lemmas |4] and [5] shows that 
condition ( f38l > occurs with probability tending to one if 
p > p TH " UB (a) and does not occur with probability tending 
to one if p < p TH UB (a), which concludes the proof. 

A. Proof of Lemma [4] 

Conditioned on any realization Y 1 -™' = y^ n ', the variables 
{Z n ^ : i ^ s} are i.i.d. Gaussian with zero mean and variance 
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<T 2 (yW,x(")) = (l + i||x (n) || 2 ) _1 ||y (n) || 2 . By the Glivenko- 
Cantelli theorem, 



D-(t)^Pr\xl > 



t 



(r 2 (yW,xW) 

almost surely and uniformly as n — > oo. 

Furthermore, by law of large numbers, cr 2 (Y( n ), xW) — ^ 1 
almost surely as n -4 oo for any random sequence {YW}, 
and thus, 

pr {^ ^)xw) h Pf{xj>t} 

almost surely and uniformly as n — > oo. Using the fact that 
Pr{x 2 > t} = 2Q(i) for t > completes the proof. 

B. Proof of Lemma \5\ 

The main challenge in this proof is that, for each problem of 
size n, the variables {Z 2 i : i G s} are neither independent nor 
identically distributed. To proceed, we first show that D+(t) 
converges in expectation to the limit D + (t), and then show 
that D+(t) also converges in probability. 

Convergence in expectation: Observe that the expectation 
of D+(t) depends only on the marginal distributions of the 
elements {Z 2 . : i G s} and is given by 

E[ J D+(t)] = ^Pr{^, i <t}. 

To further characterize the probabilities Pr{Z% i < t}, we 
decompose each column of the sampling matrix as A; = 
||Aj||Uj where Us = Aj/||Aj|| is a random unit vector 
independent of || Aj || . We can then write 

Af Y = ||A.i|| 2 Xi + Af (Y - AjXi) 

= \\Mfxi + \\Ai\\ Uf (Y - AiXi) (39) 

where the random variable Uf (Y — AiXi) is independent of 
1 1 A; 1 1 and has a Gaussian distribution with zero mean and 
variance 1 + ||x|| 2 /n — xf/n. 

Since n||Ai|| 2 is a chi-square random variable with m 
degrees of freedom, the chi-square large deviation results in 
Lemma [8] in Appendix [F] show that 



lim max 

n— too Ki<n 



\A^\\ 2 -p\=0 



(40) 



almost surely for any sequence of sampling matrices A^ n \ 

Combining (|39l and (f40l > shows that the marginal distri- 
butions of the elements {Z n ^ : i e s} are asymptotically 
Gaussian, or more explicitly, 



lim max sup 

n— >oo i£s i 



Vr{Z nti >t) -()(*—!■ 







where the mean and variance are given by 

l + |jx|| 2 /n— x\ I n 



l + ||x||Vn* 



and <i = 



*7r 



Hence, by the definition of G(-, •), the asymptotic expectation 
of D+(t) can be expressed as 



lim E[D+(t)] = lim 



/'ts 



(41) 



To evaluate the right hand side of ( |4TI ). we use a truncation 
argument to show that the effect of any "large" nonzero 
elements Xi is negligible. More precisely, we define the set 
of indices l n = {i e s : xf < s/ri} and observe that since the 
empirical distribution F n of {xi : i £ s} converges uniformly 
to F for any sequence {x^™)} G X(Q,F), 



lim max \ a n i — 1 =0 



lim max \fi r , 



i+p(n,F) 



0. 



and 



2 

lim - G[ — 

n— >oo K \0~ a 



1 

< lim -ry" 1 ^ > y/n) 



1=1 



= lim 1 - F(n x / 4 ) + ^(-n 1 / 4 ) = 

n— >oo 

where we use the fact that < G(-, ■) < 1. 

Starting from Equation ( l4TT i and using the above observa- 
tions gives 



lim E[D+(t)\ = lim \ £ G( T 



■*) 



lim 



lim 

n— f oo 



G( 



G (l+pkF)^)^) 
u 2 ,t)dF(w), 



1 + P(0,F) 



and thus we have shown that D+ (i) converges in expectation 
to the desired limit. 

Convergence in probability: We now show that the 
variance of D+(t) tends to zero uniformly as n — > oo. Using 
this result, the desired convergence in probability follows 
directly from Chebyshev's inequality. 

To begin, we define the set T n = {i g s : x\ < y/n} and 
the quantity 

&ht) := I Pr(^ < £; ^2 < t ) 



Pr(^.<t)Pr«.<t)| 



Since \I n \/k — > 1 for any sequence {x^™'} g X(Cl,F), and 
since < ofj < 1, we can bound the variance of D+(t) as 

lim sup Var(£+ (t)) < lim sup i ^ ^ 6<$ (t) 



i£s j£s 



= lim sup -j ^ ^(t) 

n— ^oo . . . 

< lim sup max ^".(t). 



(42) 



The next step is to show that the right hand side of d42t 
converges to zero uniformly in t. For a given n, let pj denote 
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the probability measure on Z n ^ and pij the joint probability Hence, 
measure on (Z n i,Z n j). Then, for all t, 

4" (*) ^ \\Pij -PiPjlWv 

where \\pij — PiPjWrv denotes the total variation distance 
between pij and PiPj. Using Pinsker's inequality P4l . the 
total variation distance can be upper bounded by the mutual 
information between Z n ^ and Z n> j as 

\\Pi,j -PiPjWrv ^ 2I(Z nii ; Z nJ ). 

To bound the mutual information I(Z n j; Z n j) we may 
expand the information I(Z nt i, Aj; Z n> j, Aj) two different 
ways using the the chain rule for mutual information to obtain 

I(Zn,i] Zn,j) — I{Z n ,i] Z n j\A.i, A.j) + I{Z n ^\ Aj | A,;) 

+ I(A i ;Z nij \A j )+I(A i ;A j ). 

In the following, we show that each of the terms on the right 
hand side of the above equation tends to zero as n —> oo. 

First, by the independence of the columns Aj and Aj we 
have I(Ai] Aj) = for all n. 

Next, we consider the term I(Z n ,i\ Aj\Ai). Observe that 
the random variable A[ Y can be decomposed as 

Af Y = Af (Y - Aj Xj ) 

where Af(Y — AjXj) is independent of Aj. Therefore, since 
Z n> i is proportional to Af Y, we can write 

I(Z nii ; Aj\Ai) = /(AfY; Af A jXj \Ai). 



Aj Aj Xj 



Conditioned on any realization Aj = a^, the variables af Y 
and af AjXj are jointly Gaussian with covariance 



£ = !la,. 



1 + ||x|| 2 /n — xf/n x 2 /n 
x 2 j/n Xj/n 



From the definition of I n , we conclude that 



lim sup max / ( Z n i u ; A j | A,; 



= lim sup max log 

= 0. 



2 /n-(x'l+x'j)/n 



A symmetric argument shows that the same limit applies to 
the term /(Aj; Z n _f\Aj). 

Lastly, we consider the term I{Z n> i\ Z n j\Ai, Aj). With a 
bit of work, it can be shown that for any realizations Aj = 
a.;, Aj = aj, the variables Z n ^ and Z n> j are jointly Gaussian 
with covariance 

l + ||x|| 2 /n-(x?+^)/n 



E = 



(m/n)(l + ||x 2 ||/n) 



Using the fact that U = ^ ||^||||a.-|| ) ^ as a B eta (l: n — 1) 
distribution, it follows that 

I{Z n ,i) Zn,j\Ai, Aj) = ^Elog 



U 



< -E- 
- 2 1 



U 



U 



1 



2(n-2) 



limsup max I(Z n x, Z n _j\Ai, Aj) = 



which completes the proof. 

C. Proof of Corollary [2] 

From Theorem [2] a sampling rate distortion pair (p, a) is 
achievable if 



Pr 



{\ w + ^j+pkn x \ <t }^ a 



(43) 



where W has a Gaussian distribution with zero mean and unit 
variance, X has distribution F and is independent of W, and 
t = Q 1 { 2 (i^n) )- Observe that for any x > 0, we can write 



{ 



Pr W 



i+p(n,F) 



X < t 



} 



<Pr{|X| >x } + p T {\W + y J=fc^X\ <t\\X\<x}. 
Furthermore, if we let x = ^/P(l, F a / 2 ), then 

x 2 = (2/ a) f ' F x ]{p)dp < F~l{a/2) 

where F^l denotes the quantile function of X 2 , and thus 

Pr{|X| >x}< a/2. (44) 
Additionally, if p satisfies (fT2l . then 

p r{K+^T+pfe*l<*|m<*} 
< p r{w<t-J^=x} 

= T+P{fi.F) x ~ *) 

< a/2. (45) 

Combining (l44l and d45l ) shows that the left hand side of d43b 
is less than or equal to a which completes the proof. 

Appendix C 
Proof of Theorem[3] 

Let (p, a) be a sampling rate distortion pair that is achiev- 
able for the source X(fl,F) and estimator s and let {A*™) 6 
^[pn]xn| me sampling matrix sequence that achieves 
this rate. Furthermore, let {x'")} <G X(fl, F) be an arbi- 
trary sequence of vectors with sparsity k n , and let k n be a 
sequence of integers that obeys lim n ^.oo \k n — k n \/n = 
and limsup^^.^ k n — k n < 0. In the following, we show 
that the asymptotic performance of the estimator s(y, A, k n ) 
corresponding to the sequence k n is the equal to that of the 
estimator s(y, A, k n ) corresponding to the true sparsity k n . 
For notational simplicity, we will often make the dependence 
on the problem size n implicit and write x, A, and k instead 
of xW, AW, and k n . 

To begin, let {x( n )} e X(fl,F) be a related sequence of 
vectors with sparsity k whose sparsity pattern s obeys |sfls| = 
min(fc, k) and whose nonzero values obey ||x — x|| — > as 
n — > oo. By the definition of X{Vl, F), it can be verified that 
such a sequence is guaranteed to exists. Also, for each integer 
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n, let Y = Ax + W and Y = Ax + W be the samples of x 
and x corresponding to the common sampling matrix A and 
noise vector W. 

Now, using the triangle inequality gives 

d(s, s(Y, A, jfe)) < d(s, s) + d(s, s(Y, A, k)), 

where, by assumption, the first term on the right obeys 

d(s, s) = max(l — k/k, 1 — k/k) — > as n — > oo. 
To deal with the second term, we note that 

d(s, s(Y, A, k)) = d(s, s(Y + A(x - x), A, k)), (46) 
and use the following lemma. 

Lemma 6. Let Y = Ax + W where x € M" and A G 
R rax " are fixed and W G K m is fl standard Gaussian vector. 
Furthermore, let s : R" 1 i-> 5" fee a sparsity pattern estimator. 
//Pr{d(s,s(Y)) > a} < e, then 

Pr{d(s,s(Y + z)) > a} < e+ \\z\\ (47) 

for all z 6 M m . 

Proof: For simphcity, suppose that s is a deterministic 
function; the extension of the proof to handle random estima- 
tors is straightforward. Define the set 

W = {w G R m : d(s, s(Ax + w)) > a}, 

and observe that Pr{W G W} < e. Thus, for any z G R m , 

Pr {d(s, s(Y + z)) > a} = Pr{W + z G W} 

< e + | Pr{W + z G W} - Pr{W G W}| 
<e+ sup | Pr{W + z e A} - Pr{W G A}| 

< e + V2 j Dkl(W + z||W) (48) 

where d48l follows from Pinsker's inequality (see for example 
ll44l ) and -Dkl('II') i s tne Kullback-Leibler divergence. ■ 
Combining Lemma |6] and (|46] > gives 

Pr {rf(s, s(Y, A, fc)) > a} 

< Pr {d(s, s(Y, A, *)) >a}+ E[|| A(x - x)||] (49) 

where the expectation is taken with respect to the random 
matrix A. By the achievability of (p, a), the first term on the 
right hand side of ( |49l tends to zero as n — > oo. Additionally, 
since the elements of the vector A(x — x) are i.i.d. Gaussian 
with zero mean and variance ||x — x|| 2 /n, it follows from 
Jensen's inequality that 

E[||A(x-x)||]<y*3I 
V n 

and thus the second term tends to zero as n — > oo by the 
assumptions on x. Hence, we have shown that for any e > 0, 

Pr {s, s(Y, A, k)) > a + e} -> as n -> oo 

which proves our desired result. 



Appendix D 
Proof of Theorem|4] 

To begin, we note that the sampling rate associated with the 
sequence {A'™'} is given by 

Um rpi-rA-»i + r^-("-rA-»)ii = Xpi + x)p2 

n— >oo n 

Also, since 

E||A(")||| = E||Ap nl) ||^ +Ej|A^ rA ' nl) ||2„ 

each matrix A'"' obeys the scaling constraint (0. 

Next, we observe that for each integer n, the vector of 
samples Y can be expressed as 



"Yi" 




A ( rA.nl) Q 




"xr 




Wi 


Y 2 _ 




A ^ rA ' nl) 




X 2. 


+ 


w 2 _ 



where Xi e R^'™! and X 2 G R n_ r A ' n l correspond to a 
random permutation and partition of the elements in x' n ) . Note 
that the sparsity patterns Si and S 2 of Xi and X 2 are random 
sets that obey E|S X | = Q.\X ■ n] and E|S 2 | = Q(n - [A • 
n]) respectively. Let Si and S 2 be estimates of these sparsity 
patterns that are made separately using the samples Yi and 
Y 2 and expected sizes (since the true sizes are unknown), and 
let S be the estimate of the original sparsity pattern s based on 
the union of Si and S 2 mapped back into the original index 
set of x. 

Since Si and S 2 correspond to a disjoint partition of s, it 
can be verified that 

d(s, S) < A ■ d(Si, Si) + (1 - A) • d(S 2 , S 2 ) (50) 

where 

A _ max(|Si|,|Si|) 

max(|Si|,|Si|)+max(|S 2 |,|S 2 |)' 

Also, since the permutation is independent of x^ n \ both 
{X^™' 1 } and {X 2 ™ } are elements of the vector source X(Q,, F) 
almost surely. Thus, ]im n ^ yoo A = A almost surely and, by 
Theorem [3] the distortions on the right hand side of d5Qb will 
be less than a.\ and a 2 , respectively, with probability tend- 
ing to one. Putting everything together shows the distortion 
Aqi + (1 — A)a 2 is achievable, which concludes the proof. 

Appendix E 
Proofs of Scaling Bounds 

A. Proof of Proposition Q] 

This proof follows from the upper bound (Q given in 
Theorem Q] We first consider the denominator. Using Lemma 
Q] Lemma|9]in Appendix[F] and the concavity of the logarithm, 
shows that there exists some constant C\ > such that 

c(i + p(pn,F p )) >Ciio g (i + /3 4L+2 p 2 ). 

Next we consider the numerator. Since the binary entropy 
function H(-) is concave and increasing on the interval 
[0, 1/2], we can write 

nH((3) + (l-n)H(-^L) < 2if(/3fi) < 4/3filog(^). 
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Thus far, we have shown that 



p(a) < f2 + — • max 



Ci a</3<l log (1 + /3 iL + 2 P 2 ) ' 

To conclude the proof, we use LemmafTOlin Appendix[F]which 
shows that it is sufficient to take the maximum over only the 
endpoints a = f3 and a = 1. 

B. Proof of Proposition [2] 

This proof follows directly from the upper bound (fT~2T > given 
in Corollary Q] Using Lemma Q] shows that there exists some 
constant C\ < oo such that 



1 + P(Q,F) 2L fl 
PV,F a/2 ) ~ Cl ' a 



P 



(51) 



Furthermore, using the bound Q 1 (p) < y— 2 log(2p) for 
< v < 1/2 (see EU pp. 53]) gives 

'1 - fT 



2(1-J2) ■ 



< 8 log 



(52) 



Combining ( Bil l and d52l completes the proof. 



Appendix F 
Technical Lemmas 

Lemma 7 ( ll44l pp. 151]). If k/n — > p as n — > oo for some 
< p < 1 f/zen, 

lim -log/; ] = ff(p) 

where H(p) is the binary entropy function. 

Lemma 8. Let x\ be chi-squre random variable with d 
degrees of freedom. For any e > 0, 

Pv{xl> (l + e)d} <cxp(-d-e 2 /4), (53) 

Pr{*d< (ih) d } <exp(-d-£(l + e)) (54) 

where £(•) is defined in (|8). 

Proof: The proof of (|53l follows directly from ||46l pp. 
1325]. To prove (|54l , we apply a Chernoff bound (see l44l 
pp. 318]) to obtain 

Pr {Xd < (l^M ^ cxp( M ( T ^)d)E [cxp(- M Xd)] 
= exp( — d • A(/i)). 

for any p > where A(/x) = ^ log(l + 2/i) - Mjt^)- By 
differentiating, it can be shown that the maximum of A(p) is 
attained as p = e/2. Noting that A(e/2) = C(l + e) completes 
the proof. ■ 

Lemma 9. For any x > 0, 

C(l + x) > ^^(l + ia; 2 ), 
where £(■) is defined in ([8JI. 

Proof: It is convenient to prove the slightly stronger result 

£(1 + x) > \ log (1 + e~ 2 x 2 ) . 



Observe that for any a > 0, the above expression can be 
expressed equivalently as 

/e a (l + e- 2 x 2 )\ 2a; 
log ( — V- — — I + a - — — > 0. 



(1 + z) 2 



1 + x 



Using the bound log(l + x) < x, it can be shown that the 
above statement is true if 

(1 + a - e a ) + 2ax - (1 - a + e a " 2 )x 2 > 0. 

Evaluating the above condition with a = 2 shows that the 
bound holds for any x > 5 and evaluating with a = 1 shows 
that the bound holds for any .4 < x < 5. 

For the case x < .4 we use a different bounding technique. 
Using the bounds x(l — x/2) < log(l + x) < x it can be 
shown that 



C(l + x) > 



x 2 (l-x) 1 



> \e~ 2 x > ilog(l + e~ 2 x 2 ) 



2(1+*) 

Hence, we have shown that the bound holds for all x > 0. 
Lemma 10. Given any < 7 < 00 and 1 < b < 00, let 

—x log(a;) 



9(x) 

Then, for any < a < 1/8, 



log(l + ~fx b ) 



max 6(x) <4max{ 9(a), 0(1/8)} 

a<x<l/8 

Proof: Let x* = (8/7) 1/b , xi = min{x*, 1/8}, and x 2 
max{a,x*}. Then, observe that 



max 9(x) = max < max 9(x), max 9(x) 

a<x<l/8 [a<i<ii x 2 <x<l/8 

Furthermore, 



max 9(x) 



-x\og(x) JX b 



a<x< Xl \ JX b log(l+7x'') 

-xlog(x)\ / 72; 
< max ; max 



a<i<ii JX b J \ct<x<Xi log(l + JX b ) 

1A 



9(a 



log(l + 7 a h )^ / 



7a" 



log(l+7a;;) 



< 49(a). 



Also, 



max 9(x) 

x 2 <x<l/8 

= max 



— x\og(x) log(7a; b ) 



x 2 <x<l/8 \ log(jx b ) log(l +JX b ) 



< max 



-x log(x) 



log(7X h ) 



x 2 <x<i/8 log(7a; b ) / \x 2 <x<i/8 log(l + jx b ) 
0(i/8) log(l + 7 (l/8) b )^ ( log( 7 (l/8) fc ) 

9(1/8) 



log( 7 (l/8) h ) J \\og(l+ 7 (l/8) b ) 



where we have used the fact (which can be verified using dif- 
ferentiation) that the function —x\og(x)/\og('fx b ) is strictly 
increasing over the interval [x 2 , 1 /8] . ■ 
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